Chapter 9 MAUP problem and regression

9.1 Leraning objectives

  1. Describe, explain and visualise the MAUP problem
  2. Design and use loops and functions
  3. Execute linear, Ridge and LASSO regression to predict values (e.g. future or missing)
  4. Critically evaluate different regression approaches

9.2 Introduction

The Modifiable Areal Unit Problem (MAUP) represents the related effects of scale and aggregation on all geographic data and analysis. It was first deomstrated by the geographer Stan Openshaw in 1984 who showed that as you aggregated results to diffrent spatial units the reults could be manipulated to show different outcomes. Throughout this practical book we’ve considered London boroughs, but would any of the results change if we considered the data at ward level? Are we hiding any trends because we have just used the borough data that summed all of the wards within the borough? It’s important to consider what is the most appropraite spatial unit for your analysis and provide appropraite reasoning. Even this is pretty straightforward (e.g. the data was only provided at the borough level) you must contextualise this and describe any potential limitations of it. In this practical i will firstly demonstrate the MAUP in action using some more advanced R code. Then we will have a look at some technqieus to model and validate data.

Here we will be using London brough and ward data from practical 1. As we’re getting better with R, we will try to automate almost everthing — meaning that if you gave this code to someone else they could just run it without any data files and generate the same result. The only thing we won’t automate later on is loading an excel file… i did find a function online that would let us read it from the interweb but it was more hassle than it was worth. I don’t know why the data isn’t just also distributed as a .csv.

9.3 MAUP

9.3.1 Get the data

  1. Download and unzip the London statistical gis boundaries.
  1. Take the downloaded data and filter it based on the filename that countains: Borough OR Ward_ AND .shp using grepl()
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

t just gives an index of where the values are equal to what we specified.

  1. We then need to use it to subset from our original data…
  1. Now read in both of the files using lapply() that applies a function (here st_read()) to a list
## Reading layer `London_Borough_Excluding_MHW' from data source `C:\Users\ucfnmac\OneDrive - University College London\Teaching\CASA0005repo\prac10_data\statistical-gis-boundaries-london\ESRI\London_Borough_Excluding_MHW.shp' using driver `ESRI Shapefile'
## Simple feature collection with 33 features and 7 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs
## Reading layer `London_Ward_CityMerged' from data source `C:\Users\ucfnmac\OneDrive - University College London\Teaching\CASA0005repo\prac10_data\statistical-gis-boundaries-london\ESRI\London_Ward_CityMerged.shp' using driver `ESRI Shapefile'
## Simple feature collection with 625 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs

To map or access each individual shapefile it’s just…

  1. Get the data for Airbnb
  1. And for OSM we’ll download it from geofabrik, you can also use the OSM Application Programming Interface (API) but there is a limit on the number of points downloadable per call so you’d have to do something a bit more complicated to get the whole of London…however, I have provided an example of the api call.

Example of using the API….

9.3.4 Loops

  1. Ok, but we want to get a count of Airbnb points per London ward and Borough…how can we do that?…well manually of course…like this…

But we can also autmoate this using a loop (either a while or for loop). I’ve used a while loop here as when i did my MSc you weren’t able to put a for loop inside a for loop. I beleive that has now changed but because of that one day i had to spend changing everything i always default to using a while loop.

Tell us what a loop is aleady?

A loop let’s you go over something adding 1 (or any value) each time…for example let’s look a basic loop. You need to run everything in the loop at once from the while to the }. If you make just a normal Rscript you can set breakpoints — the code will the stop each time it hits the breakpoint within the loop. You can’t do this at the moment with RMarkdown code chunks, i normally develop the loop outside of it looping then put it all together.

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

That loop outputs the values of 1-5, as we started with a value of 1, then added 1 to make 2. It remained below 6 so the code ran again printing the 2 then added 1 again to make 3 and so on. As we specified less than 6 it stopped there..

We can also save these results to diffrent variables but we need to make a list (or dataframe/ whatever you need) to start with to save them in

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Here we are using the varaible basicloop to index our emptylist.. so everytime we add 1 it changes the index value….have a look what i mean…

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## [1] 1
## [1] 2

Right, so how are we going to apply this to our data. We have two .shp files (boroughs and wards) in a list that we want to apply our function to..

Firstly let’s set up the length to stop at, make an empty list and a starting point. As our data is in a list we just want the length of that (as a number)..

Now here is the loop..

When should i use a loop?

Well that’s a hard question…generally loops are considered inefficient in R, but i’m yet to come across a method that will let you increment one varibale whilst keeping another consistent …by this i mean like what we did in our loop. We changed what spatial data was joined with the Airbnb data — the Airbnb data remained the same. There are a few other functions like mapply(), sapply() and tapply() that work in the same format as lapply() but to my knowledge they still will increment all of the variables at the same time.

9.3.5 Advanced mapping (again)

  1. Right, so we can sort of see the difference between the spatial levels (boroughs and wards) but let’s take a closer look within Westminster… here is the ‘preamble’ for the leaflet map …basically all the stuff we need to set it up…
  1. Now let’s map it using what we just specified… i’ve added a few more features than were in the Map making practical

Have a look around the map…Westminster borough uses a scale considering all other London borough values, whilst the ward scale is specific to Westminster. Use the following code to explore the values…

## [1]  159 1215
## [1] 9410

9.4 Regression relationships

Warning The data used within this practical is purely for demonstration purposes!

9.4.1 Preprocessing

In this part of the practical we’re going to try and model the relationship between the Airbnb counts at ward level and other variables. Our mini investagion here would be to see if it’s possible to produce a statisitcally valid and rigorous model for predicting Airbnb values at the ward level. This type of analysis might be useful if you had missing or limited data at one spatial level (e.g. wards) but had a more complete dataset at a larger spatial scale (e.g. boroughs). Essentially, we’re trying to use the data at the borough level to give us an estiamte of the data at the ward level. Obviously this isn’t an issue for us here but if you ever use survey data or any kind of count data (e.g. health) if may well be limited to specific areas…

This section will also show you how to make/ use different regression models in R. Regression aims to find a mathematical equation to predict a value (in this case Airbnb at ward level) from one (or more) predictor variables (e.g. borough data). So we’ll use values of X (borough data) to predict values of Y (ward data)…

This can be represented with the equation:

\[Y = \beta_{1} + \beta_{2} X + \epsilon\] Where \(\beta_{1}\) is the intercept (expected value of Y when X=0, also called bias in machine learning), \(\beta_{2}\) is the slope of the line and \(\epsilon\) is the error term, the part that the model is unable to explain…

  1. To start with we’re going to crop our wards data to our boroughs data. In the borough data you can see the river Thames which isn’t in the wards data. I don’t think this is technically required, but it’s good practice to make sure you datasets align and if they don’t to do something about it…

  1. Now we need to join our wards data and borough data togther to give us a borough value for each ward within…

9.4.2 Scatter plot

  1. To start with let’s just visualise the relationship between the borough an ward count using a scatter plot. If you have multiple predictor variables (explied later on) a plot will be drawn for each of them…

The scatter plot shows some kind of linear relationship. The point variation around borough count is beacuse for each borough the wards will have a range of values…like we saw in the interactive map…Westminster borough had a value of 9410 and the wards ranged from 159 to 1215

9.4.3 Outliers

  1. Generally any point that is outside 1.5* the interquartile-range (or IQR) is considered an outlier (a data point that differs significantly from all others) and you could use something like the methods here to remove and replace them…the IQR is the distance between the 25th and 75th percentile..

9.4.4 Correlation

  1. Correlation let’s use see the level of linear dependence between two varaibles between -1 and 1. High values (towards 1) mean that for every x instance there is an increase in y. Low values (towards -1) mean the opposite.
## 
##  Pearson's product-moment correlation
## 
## data:  joinednames$Boroughcount and joinednames$Wardcount
## t = 31.054, df = 622, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7468991 0.8086811
## sample estimates:
##       cor 
## 0.7796805
  1. Have a look back at the Statistical summary under Advanced raster analysis to see what all these values mean. Basically we’ve got a strong relationship that is statistically significant.

9.4.5 Linear model

  1. Now let’s use a linear model to establish a relationship between predictor and response with the function lm(). Here we are calling the lm()(then stating the Formula, then the data)
## 
## Call:
## lm(formula = Wardcount ~ Boroughcount, data = joinednames)
## 
## Coefficients:
##  (Intercept)  Boroughcount  
##      -3.8188        0.0541
## 
## Call:
## lm(formula = Wardcount ~ Boroughcount, data = joinednames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -346.25  -41.02   -3.45   20.50  709.75 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.818776   6.099543  -0.626    0.531    
## Boroughcount  0.054098   0.001742  31.054   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104.9 on 622 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6079, Adjusted R-squared:  0.6073 
## F-statistic: 964.3 on 1 and 622 DF,  p-value: < 2.2e-16

So from this output we have the coefficients of intercept and Boroughcount, going back to our formula from earlier this would mean…

\[Wardcount = -3.8188 + 0.0541*Boroughcount\]

9.4.6 Linear model outputs

p-value

A linear model is only statistically significant when both p-values are <0.05 within the summary call. There are two values as one if for the coefficient (is the specific coefficient significant) and the other is for the model (is the model significant). Here as they both are, this would allow us to accept the alternative hypothesis that the coefficients are not euqal to 0 (and therefore there is a relationship between the independent and dependent variabile)

R-squared

R squared represents the proportion of variance of the dependent variable that is explained by the independent. It is different from correlation as that is the strength of the relationship!!!

But there are a few issues with R squared — every time you add a predictor to the model R squared will increase and if there are too many predictors it will model random noise

Adjusted R squared

Adjusted R squared therefore considered the number of variables within the model, increasing only if the new term assits by more than just chance.

Standard error

The coefficient standard error represents the average amoint that coeffieint estaimtes vary from the average of our response. It basically shows the expected range if we were to model again and again

t

Coefficient t shows how many standard deviations the coefficient estaimte is away from 0 - the further away means the more likely we can reject the null hypothesis.

Residual error

Residual standard error is a measure of the quality of fit of the linear regression line. It is an average of the error that the points differ from out line.

F

F-statistcs, basically, the futher from 1 the better it is.

See this guide for more information on the outputs

9.4.7 Validation

  1. We’ve made a regression model using all of our data but there is now no way to test its validity…what if we take 20% as a test sample and use the remaining 80% to train it…

Now let’s build the model…

## 
## Call:
## lm(formula = Wardcount ~ Boroughcount, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -346.48  -43.31   -2.21   20.69  696.52 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.214418   7.141236   -0.87    0.385    
## Boroughcount  0.055760   0.002012   27.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.2 on 498 degrees of freedom
## Multiple R-squared:  0.6067, Adjusted R-squared:  0.6059 
## F-statistic: 768.1 on 1 and 498 DF,  p-value: < 2.2e-16
  1. We can just run a correlation between the data we left out and the predicted data to assess the accuracy…
## 
##  Pearson's product-moment correlation
## 
## data:  actuals_preds$actuals and actuals_preds$predicteds
## t = 14.477, df = 122, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7195673 0.8519176
## sample estimates:
##       cor 
## 0.7950186

This shows us that we have a strong and statistically significant relationship between our model output and the test data.

  1. We can also use min-max accuracy to see how close the actual and predicted values are…using the equation…

\[MinMaxAccuracy= mean\left( \frac{min(actuals, predicteds}{max(actuals, predicteds)} \right)\]

## [1] 0.6108122
  1. This gives around a 60% accuracy…another alternative is mean absolute percentage deviation which is a statistical measure showing how accurate the prediciton was…through the equation…

\[MAPE= mean\left( \frac{abs(predicteds - actuals}{actuals} \right)\]

## [1] 0.8029353

Here we’ve got a very high MAPE — this is most likely because of the range of values of the wards for individual boroughs…

9.4.8 Cross validation

  1. Our validation approach so far was OK, but what if the test data was a different 20% — would it still hold up? We can test this by dividing all our data into blocks and then running the analysis several times, each time selecting a different block to be the test data…this is called cross validation …

  1. To start with we’ll remove the geomtery, set up a data frame and replace any NAs with 0 — this is appropraite here as if there are no Airbnbs within the ward they would be 0.
  1. Now let’s change our training data… here i’ve got 5 iterations then the whole lot is repeated 3 times. Search literautre for the right number to use, i think 10 iterations with no repeats is a good place to start.
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -346.30  -40.95   -3.44   20.51  709.70 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.88398    6.08645  -0.638    0.524    
## Boroughcount  0.05411    0.00174  31.103   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104.8 on 623 degrees of freedom
## Multiple R-squared:  0.6083, Adjusted R-squared:  0.6076 
## F-statistic: 967.4 on 1 and 623 DF,  p-value: < 2.2e-16

Results show that our cross validated model accounts for around 60% of varaince of the number of Airbnbs per ward…compare this to the results from the standard linear regression…

9.5 Advanced regression

9.5.1 More data

  1. We’re going to have a look at Ridge and LASSO regression next…however, they require us to have more than 1 predictor variable, so we’ll load a few in and demonstrate how to use them in normal regression first…

We’ll get our OSM data, project it, extract only hotels, join it with the boroughs, set any NAs to 0, join it with the wards, rename the new ‘freq’ coloumn to Hotel count, remove the geometry and then set any more NAs to 0…

## Reading layer `gis_osm_pois_a_free_1' from data source `C:\Users\ucfnmac\OneDrive - University College London\Teaching\CASA0005repo\prac10_data\gis_osm_pois_a_free_1.shp' using driver `ESRI Shapefile'
## Simple feature collection with 35273 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -0.5108706 ymin: 51.28117 xmax: 0.322123 ymax: 51.68948
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

9.5.2 Multiple linear regression

  1. Re run the model including Hotel count…you simply use a +
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315.78  -40.08   -2.66   22.32  740.22 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.780607   6.192181  -0.934    0.351    
## Boroughcount  0.056349   0.002227  25.306   <2e-16 ***
## Hotelcount   -0.173119   0.107698  -1.607    0.108    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104.7 on 622 degrees of freedom
## Multiple R-squared:  0.6099, Adjusted R-squared:  0.6086 
## F-statistic: 486.2 on 2 and 622 DF,  p-value: < 2.2e-16

The summary here shows that the hotel count predictor is not statistically significant so we should remove it…how about we add some population and house price data….download the excel data from here

  1. Now run the model again
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -502.13  -42.86   -2.51   28.75  667.22 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1.559e+02  2.439e+01  -6.392 3.21e-10 ***
## Boroughcount                   4.816e-02  2.019e-03  23.853  < 2e-16 ***
## Population...2015              8.927e-03  1.521e-03   5.870 7.09e-09 ***
## Median.House.Price.......2014  1.009e-04  1.921e-05   5.252 2.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101.4 on 620 degrees of freedom
## Multiple R-squared:  0.6343, Adjusted R-squared:  0.6326 
## F-statistic: 358.5 on 3 and 620 DF,  p-value: < 2.2e-16

Our R squared and adjusted R squared have increased (albeit slightly) and all predictors are statistically significant.

Note that whilst i have selected variables pretty randomly here for demonstration, if you use something like this in practice it’s important to have supported (referenced) reasoning behind your selections.

9.5.3 Ridge regression

Ridge regression probably won’t fit our training data as well, becuase we introduce a small amount of bias to give us less overall variance…Basically least squares regression tries to minimise how far each point is away from the line…

Whereas ridge minimises this plus the value of \(\lambda\) * the slope squared…Here, \(\lambda\) controls the shrinkage that regulairze the coeffeicinets towards 0. This is also a property of LASSO regression which we will cover next.

  1. Let’s see an exmaple to show that if \(\lambda\) is 0, then we have our box standard linear regression…as the plus value of \(\lambda\) * the slope squared will equal 0…
## 
## Call:
## lm(formula = Wardcount ~ Hotelcount + Boroughcount + Population...2015 + 
##     Median.House.Price.......2014, data = joined3df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -485.11  -42.49   -2.48   29.08  692.74 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1.542e+02  2.439e+01  -6.324 4.87e-10 ***
## Hotelcount                    -1.650e-01  1.076e-01  -1.533    0.126    
## Boroughcount                   5.007e-02  2.371e-03  21.119  < 2e-16 ***
## Population...2015              8.590e-03  1.535e-03   5.596 3.29e-08 ***
## Median.House.Price.......2014  1.049e-04  1.937e-05   5.417 8.69e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101.3 on 619 degrees of freedom
## Multiple R-squared:  0.6357, Adjusted R-squared:  0.6334 
## F-statistic: 270.1 on 4 and 619 DF,  p-value: < 2.2e-16
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                                           1
## (Intercept)                   -1.542465e+02
## Boroughcount                   5.006901e-02
## Hotelcount                    -1.649869e-01
## Population...2015              8.590007e-03
## Median.House.Price.......2014  1.049261e-04
  1. The coefficient estaimtes should be the same or very very similar…now let’s try and use ridge regression to improve upon our least squares regression…first let’s show linear regression for comparison using the train data..
## [1] 8857.18
  1. Now we will find the best value for \(\lambda\) using cross validation — the lowest point of the curve is the optinmal value that best minimsed the error

Combined, this will fit a linear regression with ridge regression penatly with 10-fold cross validation to find the best \(\lambda\) values. Let’s apply it to the model

##              1
## [1,] 0.6496141
## [1] 7659.546

You can also doing this in a few less steps with the ridge package

9.5.4 LASSO

Least absolute shrinkage and selection operator (LASSO) performs varaible selection and regularisation to increase prediction accuracy of the model…the only difference to ridge is that the regularisation term is an absolute value. So instead of squaring the slope (like i metioned in Ridge regression), we take the actual value. As a result high values for coefficients are set to 0 if they are not relevant — this would remove features from the model for you.

##              1
## [1,] 0.6260367
## [1] 8573.853

9.5.5 Elastic net regression

Elastic net regression is where you change the alpha value between 0 and 1. You would use elastic net regression to see if it was possible to reduce the mean square error any further. Have a look at this video to see how to use a for loop to test different values for alpha.

9.5.6 What’s the best here

Let’s compare the results of the MSE and r sqaured for the normallm2, LASSO and Ridge as we used the same predictors…this is the code to get the values we don’t already have…

## [1] 0.6321809

Our lowest MSE and highest R sqaured value here is provided by Ridge regression…so from this anaylsis i’d select Ridge regression to predict any future values of Airbnbs within the London wards…

Warning The data used within this practical is purely for demonstration purposes!

9.6 What should i use

It depends, just because Ridge regression seems to be the best for this application, it might not be for yours. It’s important to check the suitablity of each method for both your data and anaysis.